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Abstract 

The detection of change-points in a spatially or time-ordered data sequence is an important problem 
in many fields such as genetics and finance. We derive the asymptotic distribution of a statistic 
recently suggested for detecting change-points. Simulation of its estimated limit distribution leads to 
a new and computationally efficient change-point detection algorithm, which can be used on very long 
signals. We assess the algorithm experimentally under various conditions. 

1. Introduction 

When met with a data set ordered by time or space, it is often important to predict when or 
where something "changed" as we move temporally or spatially through it. In biology, for 
example, changes in an array Comparafive Genomic Hybridizafion (aCGH) or Ghip-Seq dafa 
signal as one moves across the genome can represent an event such as a change in genomic 
copy number, which is extremely important in cancer gene detection IflTlISSl . In the financial 
world, detecting changes in multivariate time-series data is important for decision-making 
|[27l . Change-point detection can also be used to detect financial anomalies |l3l and significant 
changes in a sequence of images IfTTl . 

Change-point detection analysis is a well-studied field and there are numerous approaches 
to the problem. Its extensive literature ranges from parametric methods using log-likelihood 
functions EM to nonparametric ones based on Wilcoxon-type statistics, U-statistics and 
sequential ranks. The reader is referred to the monograph IQ for an in-depth treatment of 
these methods. 

In change-point modeling it is generally supposed that we are dealing with a random 
process evolving in time or space. The aim is to develop a method to search for a point where 
possible changes occur in the mean, variance, distribution, etc. of the process. All in all, this 
comes down to finding ways to decide whether a given signal can be considered homogeneous 
in a statistical (stochastic) sense. 

The present article builds upon an interesting nonparametric change-point detection method 
that was recently proposed by Matteson and James flSl . It uses U-statistics (see 0|) as the 
basis of its change-point test. Its interest lies in its ability to detect quite general types of 
change in distribution. Several theoretical results are presented in ITSlI to highlight some of 
the mathematical foundations of their method. These in turn lead to a simple and useful data- 
driven statistical test for change-point detection. The authors then apply this test successfully 
to simulated and real-world data. 
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There are however several weaknesses in flSl both from theoretical and practical points of 
view. Certain fundamental theoretical considerations are incompletely treated, especially the 
assertion that a limit distribution exists for the important statistic, upon which the rest of the 
approach hangs. On the practical side, the method is computationally prohibitive for signals 
of more than a few thousand points, which is unfortunate because real-world signals can be 
typically much longer. 

Our paper has two main objectives. First, it fills in missing theoretical results in IITSi 
including a derivation of the limit distribution of the statistic. This requires the effective 
application of large sample theory techniques, which were developed to study degenerate 
U-statistics. Second, we provide a method to simulate from an approximate version of the limit 
distribution. This leads to a new computationally efficient strategy for change-point detection 
that can be run on much longer signals. 

The article is structured as follows. In Sectionwe provide some context and present the 
main theoretical results. In Sectionwe show how to approximate the limit distribution of 
the statistic, which leads to a new test strategy for change-point detection. We then show how 
to extend the method to much longer sequences. Simulations are provided in Section IV A 


short discussion follows in Sectionand a proof of the paper's main result is given in Section 
Some important technical results are detailed in the Appendix. 


VI 


II. Theoretical results 


I. Measuring differences between multivariate distributions 

Let us first briefly describe the origins of the nonparametric change-point detection method 
described in il^ . For random variables Y, Z taking values in d > 1, let cpy and (pz denote 
their respective characteristic functions. A measure of the divergence (or "difference") between 
the distributions of Y and Z is as follows: 

D(Y,Z) = / 

u R 


where w{t) is an arbitrary positive weight function for which this integral exists. It turns out 
that for the specific weight function 




l^/12/^F((d + ^)/2)" ) 


which depends on a G (0,2), one can obtain a not immediately obvious but very useful result. 
Let Y, Y' be i.i.d. Fy and Z, Z' be i.i.d. Fz, with Y, Y', Z and Z' mutually independent. Denote 
by I ■ I the Euclidean norm on IR'^. Then, if 

]E(|Y|^ + |Z|^) <<x), (1) 

Theorem 2 of 1251 yields that 

V{Y,Z;^) =£{Y,Z-p,) := 2E |Y - Z|^ - E |Y - Y'|^ - E |Z - Z'|^ >0, (2) 

where we have written T>{Y,Z' (i) instead of D(Y, Z) to highlight dependence on /3. Therefore 
Q implies that £ (Y,Z;/5) G [0,oo). Furthermore, Theorem 2 of l25] says that £ (Y, Z; /5) =0 
if and only if Y and Z have the same distribution. This remarkable result leads to a simple 
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data-driven divergence measure for distributions. Seen in the context of hypothesizing a change- 
point in a signal of independent observations X = (Xj,..., X^) after the fc-th observation Xj^, 
we simply calculate an empirical version of l|^: 

1 

E (3) 

l+k<i<j<n 

Matteson and James m state without proof that under the null hypothesis of Xi,..., X„ being 
i.i.d. (no change-points), the sample divergence given in ||^ scaled by converges in 

distribution to a non-degenerate random variable as long as minJA:, n — k} ^ co. Furthermore, 
they state that if fhere is a change-poinf befween two disfincf i.i.d. disfribufions affer fhe A-fh 
point, the sample divergence scaled by tends a.s. to infinity as long as minJA:, n — k} ^ oo. 

These claims clearly point to a useful sfafisfical fesf for defecfing change-poinfs. However, we 
cannot find rigorous mafhemafical argumenfs to substantiate them in fflSl , nor in the earlier 
work ||25ll . 

As this is of fundamental importance to the theoretical and practical validity of this change- 
point detection method, we shall show the existence of the non-degenerate random variable 
hinted at in llTSl by deriving its distribution. Our approach relies on the asymptotic behavior of 
U-sfafisfic f5T3e processes, which were infroduced for fhe firsf fime for change-poinf defection 
in random sequences in ||^; see also Chapfer 2 of the book ||5j. We also show that in the 
presence of a change-point the correctly-scaled sample divergence indeed tends to infinity with 
probability 1. 

II. Main result 

Let us first begin in a more general setup. Let Xi,..., X„ be independent IR'^-valued random 
variables. For any symmetric measurable function cp : IR'^ x t R, whenever the indices 
make sense we define fhe following four ferms: 

Vk{f)--=E E 

!=1 j=k+l 

Un{cp):= E 9(Xi,Xj), 

l<i<j<n 

l<i<j<k 

(<?):= E <P(X.X^). 

k-\-l<i<j<n 

Ofherwise, define fhe ferm fo be zero; for insfance, 11^ {(p) = 0 and (q>) = 0 for k = n — 1 
and n. Note that in the context of the change-point algorithm we have in mind, cp[x,y) = 
q>^{x,y) := \x — y\^, fi G (0,2), but the following results are valid for the more general cp 
defined above. Nofice also fhaf fhe lasf fhree ferms are U-sfafisfics absent their normalization 
constants. Next, let us define 

^ - (” (I’)- 
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Observe that U^„{q>) is a general version of the empirical divergence given in Notice that 

VAcp) = Un{cp)-U^l\<p)-uf-\q>). ( 4 ) 

While is not a U-statistic, we can use to express it as a linear combination of 

U-sfafisfics. Indeed, we find fhat 


^k,n ( 4 *) 


2 (n-l) / Unjf) 
k{n — k) y n — 1 


U\^\cp) up if) \\ 

k-1 ^ n-k-1jj ■ 


Therefore, we now have an expression for Ui^„{cp) made up of U-statistics, which will be useful 
in fhe following. 

Our aim is to use a test based on (^) for fhe null hypothesis 'Ho : have the 

same distribution, versus the alternative hypothesis Hi that there is a change-point in the 
sequence Xi,..., X„, i.e.. 

Hi : There is a 7 e (0,1) such that P(Xi < t) = ■ ■ ■ = P(X^„.^j < f), 

P(Xl„.^j+i < f) = • • • = P(X„ < f), te 

and P(Xl„.j,j < to) 7 ^ P(Xl„.j,j+i < to) for some to- 

For u,v G P'^, u < V means that each component of u is less than or equal to the corresponding 
component of v. Also nofe fhaf for any z G R, [zj sfands for ifs integer part. 

Let us now examine the asymptotic properties of Uj^„{(p). We shall be using notation, 
methods and results from Secfion 5.5.2 of monograph 1211 fo provide fhe groundwork. In fhe 
following, we shall denofe by F fhe common (unknown) disfribufion funcfion of fhe X, under 
Ho/ X a generic random variable wifh disfribufion funcfion F, and X' an independenf copy of 
X. We assume fhaf 

E<p2(x,x0 = / f cp\x,y)dF{x)dF{y) < <x), (5) 

Jr'' Jr‘> 

and sef 0 = 'Ecp{X,X'). We also denofe (pi{x) = E(p(x, X'), and define 

h{x,y) = (p{x,y) - cpi{x) - (pi{y), h 2 {x,y) = h{x,y) + Q. (6) 

With this notation, we see that EF(X,X') = —0, and therefore that 'Eh 2 {X,X') = 0. Further¬ 
more, 

Ukjcp) = Uk,n{h) = Uk,n{h), (7) 

since 

Uni®) _ ( UP{&) iff (0) \ ^ i^ _ ( UPiiP) UPitp) \ ^ 
n — 1 yk — 1 n —k — 1j n — \ yk — 1 n — k—lj 

where ip{x,y) := (pi{x) + cpiiy)- As in Section 5.5.2 of 1211 . we fhen define fhe operator A on 
L 2 (R‘',F)by 

^g{x) ■■= j^p 2 {x,y)g{y)dF{y), X e R^^ e L 2 (R^F). ( 8 ) 

Let A,, i > 1, be the eigenvalues of this operator A with corresponding orthonormal eigenfunc¬ 
tions (pi, i > 1. Since for all x G R'^, 


J^p2ix,y)dF{y) = 0, 
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we see with (pi := 1, A(pi = 0 =: ^ipi- Thus (0,1) = {Ai,(pi) is an eigenvalue and normalized 
eigenfunction pair of fhe operafor A. This implies fhaf for every eigenvalue and normalized 
eigenfuncfion pair (A,, i > 2, where A, is nonzero, 

EiPiiX)cPi{X))=EcPi{X)=0. 

Moreover, we have fhaf in L 2 (]R'^ x IR'^, F x F), 


From fhis we gef fhaf 


K 

h{x,y) = lim ^A,</)i(x)<^,(y). 

K^oo “ 
i—l 


Ehl{X,X') = J2^l 

1=1 


(9) 


For furfher defails and fheorefical jusfificafion of fhese claims, refer fo Secfion 5.5.2 of 112X1 
and bofh Exercise 44 on pg. 1083 and Exercise 56 on pg. 1087 of |I7|. In facf, we shall assume 
furfher fhaf 


E I'^'l < ~- 
! = 1 


( 10 ) 


If is crucial for fhe change-poinf fesfing procedure fhaf we shall propose fhaf fhe funcfion 
h 2 {x,y) defined as in Jl0| with cp{x,y) = cp^{x,y) = \x — y\^, /3 e (0,2), satisfies ilOi whenever 
holds. A proof of fhls is given in fhe Appendix. 

Nexf, for any fixed ^ < f < 1 — w > 3, sef 




{[nt\ {n- [nt\)y 


U 


lnt\,n 


ih) 


n'^{n — 1 ) 

2[MfJ (n - [nfj) [ Unih) _ /^[n!j(^2) 

n — 1 1 [nfJ — 1 n — [nt] — 1 


( 11 ) 


We define UQ^\h 2 ) = 0, ul^\h 2 ) = U„{h 2 ), U^\h2)/0 = 0, and lT^E(^2)/0 = 0 , which gives 


Y„(/z 2 , t) = 0, for t e 



Yn{h2,t) = 



, for t e 


1 2 
n' n 




A{n-2) ( Unik) ,P 2 ) 


n — 1 


n — 3 


- , for f e 


. 2^1 

1 --, 1 -- 
n n 


¥„(F 2 ,f) 


2 (h- 1 ) 

m 2 



n-2 


for t e 



and Y„{h 2 ,1 ) = 0. 


One can readily check fhaf Y„ {h 2 , ■) G [0,1], fhe space of bounded measurable real-valued 
funcfions defined on [0,1] fhaf are righf-confinuous wifh leff-hand limifs. Nofice fhaf on 
accoimf of we can also wrife Y„(h 2 , ■) = Y„(<p, ■), and we will do so from now on. In fhe 
following fheorem, denofes a sequence of independenf sfandard Brownian bridges. 
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Theorem II.l Whenever X,-, i > 1 are i.i.d. F and cp satisfies (|^ and ■) converges weakly 

in [0,1] to the tied down mean zero continuous process Y defined on [0,1] by 


¥(f) :=^A,- f(l-0- 


i=l 


In particular, 


sup \Yn{cp,t)\ 
tern 


sup \Y{t)\ 
t6[04] 


The proof of fhis fheorem is deferred fo Section VI 


Remark II.l Note that a special case of Theorem 11.1 says that for each t e (0,1), 


{[nt\ {n- [nt\ )y 
-{n-l) 




( 12 ) 


This fixed t result can be derived from part (a) of Theorem 1.1 of S3- iEl point out that convergence in 
distribution of a statistic asymptotically equivalent to the left side o/(|l2]) to a nondegenerate random 
variable should follow from fWI under the null hypothesis of equal distributions in the two sample case 
that they consider. Also see S3- (S3 discuss the consistency of their statistic.) To the best of our 
knowledge, we are the first to identify the limit distribution of the „((p). We should point out here 
that the weak convergence result in Theorem III does not follow from Neuhaus' theorem S3, since his 
result is based on two independent samples, whereas ours concerns one sample. 


As suggested in lUSl , under the following assumption, a convergence with probability 1 
result can be proved for the empirical statistic in ||^. We shall show that this is 

indeed the case. 


Assumption 1 Let Y,-, i > 1, and Z,, i > 1, be independent i.i.d. sequences, respectively Fy and Fz. 
Also let Y,Y' be i.i.d. Fy and 7,,7l be i.i.d. Fz, with Y,Y',Z and Z' mutually independent. Assume 
that for some f> e (0,2), E(|Y|^ + \Z\b) < oo. Choose j e (0,1). For any given n > I/ 7 , let 
Xj = Yi, fori = l,...,[n'y\, and = Z;, for i = 1,... ,n - [nj]. 

Lemma II.l Whenever for a given f> e (0,2) Assumption^holds, with probability 1 we have: 

^L.7j,n(X;/5)^£:(Y,Z;/l). (13) 


The proof of this can be found in the Appendix. Next, let (p{x,y) = \x — yfi, f g (0,2). We see 
that for any 7 G ( 0,1 ) for all large enough n, 


sup |¥„(^,t)| > 

t6[0,l] 


{[ni\ in - [ny]))" 

n^{n — 1 ) 


,n (^' 


where it is understood that Assumption [ijholds. Thus by Lemma II.l under Assumption]^ 
whenever Fy Fz, with probability 1 , 


sup \Y„{cp,t) \ 00 . 

t6[0,l] 


This shows that change-point tests based on the statistic supjgjQ^^j \Y„{(p,t)\, under the sequence 
of alternatives of the type given by Assumption]^ are consistent. This also has great practical 
use when looking for change-points. Intuitively, the k E {1,..., n} that maximizes (]^ would 
be a good candidate for a change-point location. 
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III. From theory to practice 


Theorem |IL1| and the consistency result that follows it lay a firm fheorefical foundafion fo 
jusfify fhe change-poinf mefhod infroduced in IflSlI . For fhe presenf arficle, since we are nof 
aware of a closed form expression for fhe disfribufion funcfion of fhe limif process, we may 
imagine fhat fhis asympfofic result is of limited practical use. Remarkably, it turns out that we 
can efficiently approximate via simulation the distribution of ifs supremum, leading fo a new 
change-poinf defecfion algorifhm wifh similar performance fo 1151 buf much fasfer for longer 
signals. For insfance, finding and fesfing one change-poinf in a signal of lengfh 5 000 fakes 
eighf seconds wifh our mefhod and eighf minufes using ITSl . 

To simulafe fhe process Y we need frue or esfimafed values of fhe A,-. Recall fhaf fhese are 
the eigenvalues of the operator A defined in (|^. Following fl2l . fhe (usually infinite) spectrum 
of A can be consisfenfly approximafed by fhe (finife) specfrum of fhe empirical n x n mafrix 
H„ whose {i,j)-th enfry is given by 

Hn{X„Xj) = ^ {cp{Xi,Xj) - pi{i) - pi{j) + y ^), 

where is fhe vector of row means (excluding fhe diagonal enfry) of mafrix <p(X„ Xj) and rj 
fhe mean of ifs upper-diagonal elemenfs. 

In our experience, fhe A, esfimafed in fhis way fend fo be quite accurafe for even small n. 
We asserf fhis because upon simulafing longer and longer i.i.d. signals, rapid convergence of 
fhe A, is clear. Furfhermore, as there is an exponential drop-off in fheir magnifude, working 
with only a small number (say 20 or 50) of the largest ones appears to be sufficient for obtaining 
good results. We illustrate these claims in Section]^ Let us now present our basic algorithm 
for defecfing and fesfing for one pofenfial change-poinf. 

Algorithm for detecting and testing one change-point 

1. Given signal X\,... ,Xn, n > 4, find fhe 2 < k < n — 2 fhaf maximizes fhe original 
empirical divergence given in |j^ mulfiplied by fhe correcf normalizafion given in ( [TT| ), 

i.e., k^{n — k)^/n^{n — 1), and denote fhe value of fhis maximum t*. 

2. Calculate the m largest (in absolute value) eigenvalues of the matrix H„, where Xj) = 
|X, -X/and^e (0,2). 

3. Simulate R times the m-trimcated version of Y(f) using fhe m eigenvalues from fhe 
previous step. Record fhe R values si,... ,sr of fhe (absolufe) supremum of fhe process 
obfained. 

4. Rejecf fhe null hypofhesis of no disfribufional change (af level a) if < oi, where 

^crit •= R iFis case, we deduce a change-poinf af fhe k af which t* is 

found. Typically, we sef a = 0.05. 

Remark III.l One may imagine extending this approach to the multiple change-point case by simply 
iterating the above algorithm to the left and right of the first-found change-point, and so on. However, 
as soon as we suppose there can be more than one change-point, the assumption that we may have 
Xi,..., Xj- i.i.d., with a different distribution to Xj+j ,... ,Xn i.i.d., is immediately broken. Therefore 
the theory we have presented does not directly follow over to the multiple change-point case. It would 
be interesting to cleanly extend the results to this, but this would require further theory and multiple 
testing developments, which are out of the scope of the present article (for references in this direction, see, 
e.g., m)- 
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The E-divisive algorithm described in Ifl5l follows a similar logic to our approach except that 
is calculated via permutation (see 1191 1. Instead of steps 2 and 3, the order of the n data is 
permuted R times and for the r-th permuted signal, 1 < r < step 1 is performed to obtain 
the absolute maximum Sr- The same step 4 is then used to accept or reject the change-point. 

The permutation approach (E-divisive) of IITSl is effective for short signals. Indeed, HOl 
showed that if one can perform all possible permutations, the method produces a test that 
is level a. However, a signal with n = 10 points already implies more than three million 
permutations, so a Monte Carlo strategy (i.e., subsampling permutations with replacement) 
becomes necessary, typically with R = 499. This also gives a test that is theoretically level a 
(see 1191 ) but with much-diminished power. 

One could propose increasing the value of R but there is an unfortunate computational 
bottleneck in the approach. Usually, one stores in memory the matrix of |X, — Xj\^ in order to 
efficiently permute rows/columns and therefore recalculate t* each time. But for more than 
a few thousand points, manipulating this matrix is slow if not impossible due to memory 
constraints. The only alternative to storing and permuting this matrix is simply to recalculate 
it each time for each permutation, but this is very computationally expensive as n increases. 
Consequently, the E-divisive approach is only useful for signals up to a few thousand points. 

In contrast to this, our algorithm, based on an asymptotic result, risks rmderperforming on 
extremely short signals, and its performance will also depend on our ability to estimate well 
the set of largest A,. In reality though, it works quite well, even on short signals. The matrix 
with entries | X, — Xy | ^ needs only to be stored once in memory, and all standard mathematical 
software (such as Matlab and R) have efficient functions for finding its largest m eigenvalues 
(the eigs frmction in Matlab and the eigs function in the R package rARPACK). Each iteration of 
the algorithm's simulation step requires summing the columns of an m x T matrix of standard 
normal variables, where m is the number of A, retained and T the number of grid points 
over which we approximate the Brownian bridge processes between 0 and 1. Eor m = 50 and 
T = 1000 it takes about one second to perform this R = 499 times, and is independent of the 
number of points in the signal. In contrast, the E-divisive method takes about ten seconds 
for M = 1000, one minute for n = 2 000, eight minutes for n = 5 000, etc. One clearly sees the 
advantage of our approach for longer signals. 

IV. Experimental validation and analysis 
1. Simulated examples 

It is very important to start with the simplest possible case in order to demonstrate the 
fundamental validity of the new method. A basis for comparison is the E-divisive method 
from ITSl . Here, we consider signals of length n e {10,100,1000,10 000} for which either the 
whole signal is i.i.d. 1) or else there is a change-point of height c e {0.1,0.2,0.5,1,2,5} 
after the (M/2)-th point, i.e., the second half of the signal is i.i.d. 

In the former case, we look at the behavior of the T5q)e I error, i.e., the probability of 
detecting a change-point when there was none. We have fixed a = 0.05 and want to see how 
close each method is to this as n increases. In the latter case, we look at the power of the test 
associated to each method, i.e., the probability that an actual change-point is correctly detected 
as n and c increase. We averaged over 1000 trials. In the following, unless otherwise mentioned 
we fix /3 = 1. Eor the asymptotic method, the Brownian bridge processes were simulated 499 
times; similarly, for E-divisive we permuted 499 times. Both null distributions were therefore 
estimated using the same number of repeats. Note that we did not test the E-divisive method 
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for n = 10000 because each of the 1000 trials would have taken around two hours to run. 
All hmes given in this paper are with respect to a laptop with a 2.13 GHz Intel Core 2 Duo 
processor with 4Gb of memory. Results are presented in Figure]^ 



Figure 1: Statistical power of the asymptotic (solid line) and E-divisive (dotted line) methods for detecting change c 
in mean in a Gaussian signal of length n. The first nil. points are i.i.d. A/’(0,1) i^nd the last njT. points 
i.i.d. Af{c,l). The Type I error is also shown (c = 0). Results are averaged over 1 000 trials. 


For the Type 1 error, we see that both methods hover around the intended value of .05, 
except for extremely short signals (n = 10). As for the statistical power, it increases as n and 
c increase. Furthermore, the asymptotic method rapidly reaches a similar performance as 
E-divisive: for n = 10, E-divisive is better (but still with quite poor power), for n = 100 the 
asymptotic method has almost caught up, and somewhere between n = 100 and n = 1000 the 
results become essentially identical; the asymptotic method has a slight edge at n = 1000. 

Let us now see to what extent our method is able to detect changes in variance and tail 
shape. We considered Gaussian signals of length n G {10,100,1000,10000} for which there 
is a change-point after the (n/2)-th point, i.e., the first half of the signal is i.i.d. 1) and 
the second half either i.i.d. Af{0,(T^) for cr^ G {2,5,10} or i.i.d. Student's distributions with 

V G {2,8,16}. Results were averaged over 1000 trials and are shown in Figured 

As before, the statistical power tends to increase as n increases and either ^ increases or 

V decreases. The asymptotic method matches or beats the performance of E-divisive starting 
somewhere between n = 100 and n = 1000. 

Next, we take a look at the performance of the algorithm when the change-point location 
moves closer to the boundary. As an illustrative example, we work with sequences of length 
1000 and either place the change-point after the 100th, 300th or 500th point. Eigure|^ shows 
histograms of 1000 repetitions for the predicted location of the change-point, here a change in 
mean of c = 0.5 (hardest), c = 1 (medium) and c = 2 (easiest). We see that moving towards the 
boundary increases the variance and bias in the prediction. However, as the problem becomes 
easier (bigger jump in mean), both the variance and bias decrease. Similar results are found 
when looking at change in variance and tail distribution. 

II. Algorithm for long signals 

Remember that as it currently stands, the longest signal that we can treat depends on the 
largest matrix that can be stored, which depends in turn on the memory of a given computer 
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Figure 2: Statistical power of the asymptotic method (solid line) and E-divisive method (dotted line) for detecting 
change in variance (left) and tail (right) in a signal of length n. The first njT. points are i.i.d. Af{0, 1) 
and the last nIT. points either i.i.d. J\f{0,cr'^), 6 {2,5,10} (left) or from a Student's ty distribution 

with V degrees of freedom, v € {2,8,16} (right). Residts are averaged over 1 000 trials. 



Figure 3: Detecting change in mean of c = 0.5,1 or 2 located at different distances to the boundary (change-point 
location cp = 100,300,500) in standardized Gaussian signals with 1 000 points. Plots show histograms 
of predicted change-point location over 1 000 trials. 


(memory problems for simply manipulating a matrix on a standard PC typically start to occur 
aroimd n = 10-15000). For this reason, we now propose a modified algorithm that can treat 
vastly longer signals. 
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Long-signal algorithm 

1. Extract sub-signal of equidistant points of lengfh 2 000. 

2. Run the one change-point algorithm on this. If the null hypothesis is rejected, output 
the index k of the predicted change-point in this sub-signal. Otherwise, state that no 
change-point was found. 

3. If a change-point was indeed predicted, get the location k' in the original signal corre¬ 
sponding to k in the sub-signal and repeat step 1 of the one change-point algorithm in the 
interval [k' — z, k' + z] to refine the prediction, where z is user-chosen. If £ is the length 
of the interval between sub-signal points, one possibility is z := min(2£, 1000), where the 
1000 simply ensures this refining step receives a computationally feasible signal length 
of at most 2 000 points. 

We tested this strategy on simulated standard Gaussian signals of length 10^, 10^, 10®, 10® 
and 10^ with one change-point at the midpoint, a jump of 1 in the mean. Figure (left) shows 
the time required to locate the potential change-point. 




Figure 4: Long-signal change-point detection. Left: Computing time for signals with 1 000 to 10 million points. 

Right: Variance in first change-point prediction over 1 000 trials after scaling signals to the interval 

[ 0 , 1 ]. 


Clearly, this is rapid for even extremely long signals. Looking at the algorithm, we see 
that it merely involves finding a change-point twice, once in the sub-signal, then once in a 
contiguous block of the original signal of at most length 2 000. As these two tasks are extremely 
rapid, the increase in computation time seen mostly comes from the computing overhead 
of having to extract the sub-signal from longer and longer vectors in memory. In Figure 
(right), we plot the log signal length against the normalized variance, which means that we 
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calculate the variance in predicted change-point location over 1000 trials after first dividing the 
predictions by the length of the signal. Thus all transformed predictions are in the interval [0,1] 
before their variance is taken. This shows that relative to the length of the signal, subsampling 
does not deteriorate the change-point prediction quality. Instead, what deteriorates due to 
subsampling is the absolute prediction quality, i.e., the variance in predicted change-point 
location does increase as the signal length increases. However, we cannot get around this 
without introducing significantly more sophisticated subsampling procedures, beyond the 
scope of fhe work here. 


V. Discussion 

We have derived the asymptotic distribution of a statistic that was previously used to build 
algorithms for finding change-points in signals. Our new result led to a novel way to construct 
a practical algorithm for general change-point detection in long signals, which came from 
the surprising realization that it was possible to approximately simulate from this quite 
complicated asymptotic distribution. Furthermore, the method appears to have higher power 
(in the statistical sense) than previous methods based on permutation tests for signals of a 
fhousand points or more. We tested the algorithm on several simulated data sets, as well as a 
subsampling variant for dealing wifh extremely long signals. 

An interesting line of fufure research would be fo find ways fo segmenf fhe original signal 
without requiring stocking a matrix in memory with the same number of rows and columns as 
there are points in the signal, currently a bottleneck for our approach and even more so for 
previous permufation approaches. Furthermore, the pertinent choice of fhe power (0,2) 
remains an open quesfion. Lasfly, fheorefically valid and experimentally feasible extensions of 
fhis framework fo fhe mulfiple change-poinf case could be a fruifful line of fufure research. 


VI. 


Proof of Theorem 


II.l 


To prove Theorem II.l we require a useful technical result. Let us begin with some notation. 
For each integer X > 1, let D^[0,1] denote the space of bounded measurable functions defined 
on [0,1] faking values in fhat are right-confinuous with left-hand limits. For each integer 


n > 1, let V®, A: > 1, be a sequence of processes taking values in [0,1] such that for some 
M > 0, uniformly in A: > 1 and n > 1, 


E 


sup 

vPit) 

\t€l0,l] 



< M. 


(14) 


For each infeger X > 1, define the process taking values in D^[0,1] by 

Vn,K = . . 

Assume that for each infeger X > 1, ^n,K converges weakly as n —t oo to the D^[0, l]-valued 
process Vk defined as 

Vk:= (vW.VW), 
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where k > 1, is a sequence of [0,1]-valued processes such that for some M > 0, 

uniformly in A: > 1, 


E 


sup 


vW(f) 


< M. 


(15) 


We shall establish the following useful result. 


Proposition VI.l With the notation and assumptions introduced above, for any choice of constants 
Urn, m >1, satisfying |flm| < oo, the sequence of D^[0,l]-valued processes 

T„ := f; 

m—1 

converges weakly in [0,1] to the [0, l]-valued process 


m—1 


Proof. Notice that by •ED 


1 E 

sup 

vir^t) 

\m—l 

t6[0,l] 



oo 

< M ^ |flm| < OO- 

m—1 


From this we get that with probability 1, for each n >1, 


m=l 


I 

t€[0,l] 




< OO, 


which in turn implies that with probability 1, for each n > 1, 


lim sup 


rfV) 


= 0 , 


(16) 


where 

00 

TfV):= E 

m=K+l 

Since for each n > 1 and X > 1, e D^[0,1], where := by completeness 

of [0,1] in the supremum metric (see page 150 of monograph [Jj), we infer that Tn e Dl[0,l]. 
In the same way we get using (15) that 


lim sup 




= 0 , 


(17) 


where 

OO 

¥^\t):= ^ fl,„vW(t), 

m—K-\-l 

and thus that T E [0,1]. Also, since by assumption for each integer X > 1, V„ ^ converges 

weakly as n —)• oo to the D^[0, l]-valued process Vjc, we get that converges weakly in 
[0,1] to where 


Tf ’ := E and := E AmV)"*). 


Urn 


m—1 


m—1 
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We complete the proof by combining this with ( [T^ and \V7\ , and then appealing to Theorem 
4.2 of 10. □ 


We are now ready to prove Theorem 11.1 It turns out that it is more convenient to prove the 
result for fhe following version of fhe process namely 

(fc, - (fe), 

which is readily shown fo be asympfofically equivalenf fo Y„{h 2 , t). Following pages 196-197 
of II 2 T] , we see fhat 


lUnih) 


” k=l 


- 

k=l 




\i=l 


i=l 


■ Yh ^k^k,n' 


k=l 


'\nt\ \ ^ [nt\ 

Y:<pk{Xi)/Vn) --YY^Plm 

, 1=1 / ” 1=1 


=: YY 

k=l 


and 


2 Up’ (h2) 

-- = hi'^k 


k=l 


Y: MX^)/Vn\ -It 

i=l+[nt\ f 1 = 1 +[ntj 


k=\ 


Thus, 


^ E Ad ) - E A.vd(o. 

(18) 


k=l \ “ “ “ / J:=l 

Lef {W «},>1 be a sequence of sfandard Wiener processes on [0,1]. Wrife 


¥(f) := £;AfcV«(f), 

k=\ 


where, for k>\, 


- t 


vW(f) = f(l - t) ^wW(l)) - ij - (1 - f) ^wW(f)^ 

-t ^(wW(l) - wW(f))^ - (1 -f)^ 

= f(i-f) + 1 ) -(i-f) - f (wW(i) - (19) 


A simple app licafion of Doob's inequalify shows fhaf fhere exisfs a consfanf M > 0 such fhaf 
1 14 1 and 115 1 hold, for and defined as in 1181 and 119 1 . 

For any infeger K> 1, lef Ui be fhe random vector such fhaf = {(pi (Xi),..., (p^iXi)). 
We see fhaf ]E(Ui) = 0 and ]E(lLJilLJ^) = Ik- For any n > 1 lef Ui,.. .,U„ be i.i.d. Ui. 
Consider fhe process defined on D^[0,1] by 

W„,K(f) := I ^ .„-l/2 ^ ,/,j^(Xy) I =: (Wppf). w|,^V)) , 

\ /<L”fJ /<LntJ / 
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where for any i >1, 


W«(f) E 

i< L«tJ 

Notice that as processes in t e [0,1], 

W„,j,(t) n-i/2 Uj- 

i< L”fJ 

Clearly by Donsker's theorem the process (W„ j<;(t))o<t<i converges weakly as n —> oo to 
the IR^-valued Wiener process (Wj^(t))o<f<i, with mean vector zero and covariance matrix 
(ti A t 2 )lK, h,h e [0,1], where 

WK{t) := (wW(t),...,w(^)(t)) . 

Using this fact along with the law of large numbers one readily verifies that for each integer 
K > 1, converges weakly as n —)• oo to ), where and 'V’(') 

are defined as in | |18^ and All the conditions for Proposition |V1.1| to hold have been 
verified. Thus the proof of Theorem 11.1 is complete, after we note that a little algebra shows 
that Y(t) is equal to 

E A, (^t(l - t) - (w«(t) - twW(l))') = E A, (^t(l - t) - (BW(t))') , 

where BW(t) = wW(t) - twW(l), i > 1, are independent Brownian bridges. □ 

VII. Appendix 


I. Proof of Lemma 


III 


Notice that for each n > 1, f „(X; fi) is equal to the statistic in (j^ with k = \_n'y\. By the 
law of large numbers for U-statistics (see Theorem 1 of Il20l l for any 7 £ (0, 1), with probability 
1 , 


[nj] 

2 


-1 


E 

l<i<;< [nyl 


■nfi 


and 


n — lnj\ 

2 


Yy - Y; r ^ E y - Y' 


jZy-z.-r ^Ejz-z'l 


E 

'y<i<j<n—[n'Y\ 

Next for any M > 0, write 

|i/ — z|^ = jy — z|^ 1 {|i/| < M, jzj < M} + \y — z\^l{\y\ < M, jzj > M} 

+ jy - z|^ 1 {|y| > M, jzj < M} + |y - zf 1 {jyj > M, jzj > M} . 

Applying the strong law of large numbers for generalized U-statistics given in Theorem 1 of 
], we get for any M > 0, with probability 1, 

lny\ n— [ 117 ] 

E E |L-Z;i^l{|L| <M,|Zy| <M} 


[ny\ {n - \n^\) 

2E (|Y- Z|^1{|Y| < M, jZj < M}^ . 


15 








Also observe that 


2 [n7jn-[n7j 

2 [n7jn-[n7j 

n— [n 7 j 


< 


n — 


^ E ('«+|Z,l)'l{|Z/l>M}^ 


By the usual law of large numbers for each M > 0, wifh probabilify 1, 

^ n-ln- 

_^_ y 

n - [M7j 


f7—[f77j 

y (M+|Zy|)^l{|Zy| >M} ^2E ((M+|Z|)/^1{|Z| > M}) 


< 


2/^+1e (|Z|^1{|Z| > M}) 


Thus, wifh probabilify 1, for all M > 0, 


lim sup 


[n 7 j n— [nyl 


1«7J(»-M)S ,S |r.-z,ri{|r,|<M,|z,|>M} 


< 2l^+^E (|Z|^1{|Z| > M}) . 

In fhe same way we gef fhaf, wifh probabilify 1, 

L.7J in- L„,J) I'"I"' !"■ -"/Mir.l >M,|Z,| < M} 

< 2E ((|Y| +M)^1{|Y| > M}) < (|Y|^1{|Y| > M}) . 

Finally, nofe fhaf, by fhe Cr-inequalify, 

2 ^1/3. 


[nj\ in — [nqj) 


[n 7 j n— [f77j 

E E |y,-Z/l{|Y,| >M,|Zy| >M} 


< 


< 


i=l j=l 

2 ^ [nj\n-[nj\ 


E (|r/ + |z/)i{7l>M,|z,|>M} 

2/5 L^J __ 2/5 ”"^'5'/ 


y |Y,f 1{|Y,|>M} + 


E |Z/1{|Z;|>M}. 


[nj\ p -5 ’ n-[ny ^ 

By fhe law of large numbers fhis converges, wifh probabilify 1, fo 

2/^E (|Y|/^1{|Y| > M}) +2/^E (^\zfl{\Z\ > M}) . 
Obviously as M —>• oo, 

2E (^\Y-Zfl{\Y\ < M, |Z| < M}) ^2E\Y-Zf 
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and 


3-2/^e(|Y|/^ 1{|Y| >M}j + 3 ■ 2^E (|Z|/^ 1 {|Z| > M} j ^0. 

Putting everything together we get that 1131 holds. □ 

II. A technical result 

Let X and X' be i.i.d. F and let tp be a symmetric measurable fimction from x ^ M 
such that E^^(X, X') < oo. Recall the notation |j^. Let A be the operator defined on L 2 (JR‘^,F) 
as in ||^. 

Notice that 

E(g(X)h 2 (X,X')g(X')) = J^^g{x)Ag{x)dF{x) =: {g,Ag). 


Let us now Introduce some useful definitions. Given /3 e (0,2) and (p^{x,y) = |x — y\^, 
define as in ||^, 

h2,fi{x,y) = q>fi{x,y) - cpyfii^) “ and Hii{x,y) = F^(x,i/) + E^^(X,X'). 


II.l 


as 


( 20 ) 


The aim here is to verify that the function h 2 ^p{x,y) satisfies the conditions of Theorem 
long as 

E\X\^^ < 00 . 

Let A^ denote the integral operator 

^lig{x) = h2,ii{x,y)g{y)dF{y), xeM.'^, ge L2(R^F). 

Clearly 1201 implies l|^ with <p = cp^, which, in turn, by implies 

E/i2^(X,X') = J^^hli^{x,y)dF{x)dF{y) = < oo, 

where A,, i > 1, are the eigenvalues of the operator Ap, with corresponding orthonormal 
eigenfunctions (pi, i > 1. 

Next we shall prove that when | |20^ holds then the eigenvalues A,, i > 1, of this integral 
operator Ap satisfy ROl. This is summarized in the following lemma, whose proof is postponed 
to the next paragrapET 


Lemma VII.l Whenever for some f e (0,2), (20' holds, the eigenvalues Aj, i > 1, of the operator Ap 
satisfy ([I^. 


The technical results that follow will imply that A; < 0 for all i > 1 and is finite, 

from which we can infer 1101, and thus Lemma [VII.l Let us begin with two definitions. 

Definition VII.l Let X be a nonempty set. A symmetric function X : A x A —> R is called positive 
definite if 

n n 

J2J2ciCjK{xi,Xj) > 0 
i=ii=i 

for alln > 1, ci,...,c„ e R and xi,...,x„ e X. 
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Definition VII.2 Let X he a nonempty set. A symmetric function K : X x X —> ]R is called 
conditionally negative definite if 

n n 

'£^'£^CiCjK{xi,Xj) < 0 
z=l;=l 

for alln > 1, ci,..., c„ e IR such that Y 4 =\ Ci = 0 and xi,..., G X. 

Next, we shall be using part of Lemma 2.1 on page 74 of IIJ, which we sfafe here for 
convenience as Lemma IVIL2I 


Lemma VII.2 Let K be a symmetric function on X x X. Then, for any xq g X, the function 

K{x,y) = K{x,xq) +K{y,xo) -K{x,y) -K{xq,xq) 

is positive definite if and only ifK is conditionally negative definite. 

The following lemma can be proved jusf as Corollary 2.1 in ||8l . 

Lemma VII.3 L et H : x ^ 'K be a symmetric positive definite function in the sense of 

Definition VII.l Assume that H is continuous and 1EH^(X, X') < oo, where X and X' are i.i.d. F. 
Then R{g{X)H{X,X')g{X')) > 0/or all g G L 2 (]R^, F), i.e., H is L^-positive definite in the sense of 

We recall fhaf an operafor L on L2(]R'^, F) is called positive definife if for all g G L2(]R'^, F), 
{g^^g)>0- 

Proposition VII.l Let cp : x ^ R be a sy mmetr ic continuous function that is a conditionally 

negative definite function in the sense of Definition V11.2 Assume that q>{x,x) = Ofor all x G IR'^ and 


R(p^{X,X') < 00 . Then cp defines a positive definite operator L on L 2 (]R“, F) given by 
^gix) = - [ ,Hx^y)g{yW{y), x eR‘^,g e L2 {r‘^,f), 
where h is defined as in ([^. Furthermore the operator L on L 2 {R‘^,F) given by 

Lg{^) = -/^, {hix,y) +EcpiX,X')) g{y)dF{y), x G R^,g G L2(1R^F), 

is also a positive definite operator on L 2 (]R^,F). 

Proof. We must show that for all g G L 2 {R‘^, F), 

{g,Lg) = -TE{g{X)h{X,X')g{X')) > 0. 

For any u G IR'^, lef us wrife 

cp{x,y,u) := (p{x,u) + cp{y,u) - (p{u,u) - cp{x,y) 

= (p{x,u) + cp{y,u) - Cp{x,y). 


Since (p is assumed fo be conditionally negafive definife, by Lemma VH^ we have fhaf for any 
fixed u G IR"^, q>{x,y, m) is posifive definife in fhe sense of Definifion VII.l Hence, since q) is 


also assumed fo be continuous, by Lemma VIL3 for all g G L 2 (]R“, F), 


]E(^(X)<p(X,X',w)^(X0) >0. 
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Noting that if U has distribution function F, E<p(x,i/, U) = —h[x,y), we get, assuming that X, 
X' and U are independent, that 

E {g{X)cp{X,X',U)g{X')) = -E {g{X)h{X,X')g{X')) > 0. 

Next, notice that for any eigenvalue and normalized eigenfunction (A,, pair, i > 1, of the 
operator L, we have 

= L^i{x) = - (F(x,i/) +E<p(X,X')) {y)dF{y). 

Now, 

[ {h{x,y) + E(p{X,X')) dF{y) = 0, for all x e R^, 

implies fhaf (Ai,^i) := (0,1) is an eigenvalue and normalized eigenfimcfion pair of L. From 
fhis we gef fhaf whenever A, 7 ^ 0, E(^, (X) = 0, i > 2, which says fhaf for such A,, 

Xi^iix) = - J^^h{x,y)^i{y)dF{y). 

This implies that whenever for some i > 1, (A,,(^,), with A, 7 ^ 0, is an eigenvalue and 
normalized eigenfunction pair of the operator L, it is also an eigenvalue and normalized 
eigenfunction pair of the operator L. Moreover, since the integral operator L is positive definite 
on L 2 (R'^, F), this implies that for any such nonzero A, (where necessarily i > 2) 

- + lE(p(X,X')) ^i{y)dF{x)dF{y) 

= “ / . f = Xi > o, 

which says fhaf fhe operator L is posifive definite on L 2 (R‘^, F). □ 


III. Proof of Lemma 


VII.l 


A special case of Theorem 3.2.2 in [[H says that the fimction cp^{x,y) = \x — y\^, /3 e (0,2), 
is conditionally negative definite. Also see Exercise 3.2.13b in ITJ and the discussion after 
Proposition 3 in Il26l . Therefore by Proposition | VII. 1 1 the integral operator defined by the 
function 

i^^y) = -^2,/3 {x,y) 

is positive definite as well as the integral operator defined by fhe funcfion 

{x,y) = -h 2 ,f,{x,y) - E<p^(X,X'). 

Nexf, as in fhe proof of Proposition VII. 1[ any eigenvalue and normalized eigenfunction 

(A;, l^,') ( A,', (pi') 


pair, with A, 7 ^ 0, i > 1, of the operator is also an eigenvalue and normalized 

eigenfunction pair of fhe operafor L^. 

We shall apply Theorem 2 of 1231 to show fhaf uniformly on compacf subsefs D of R"^, 


00 


Kfi{x,y) = 

i=l 


{x,y) e D X D, 
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where pi > 0, i > 1, are the eigenvalues of the operator with corresponding 

normalized eigenfunctions ipi, f > 1. In particular 

CO 

{x,x) = X eD, 

and thus since Ei/’?(X) = 1, i > 1, and EfC^ we get 

00 

J2pi < ~- 

i=l 

Therefore since, as pointed out above, the eigenvalue and normalized eigenfunction pairs 
(—A„ —(pi) of Lp = —Ap, with A, ^ 0, are also eigenvalue and normalized eigenfunction pairs 
of the operator this implies that |A, | < oo. 

Our proof will be complete once we have checked that satisfies the conditions of Theorem 
2 of II23II . 

Since the function (pa{x,y) = \x — y\^, j5 e (0,2), is conditionally negative definite, by 


for any fixed m G K the function 


Lemma 


V1L2 


the funchon {x,y) is posihve definite. To see this note that by Lemma 


V1L2 


cpp {x,u) + (Pfi {u, y) - cpp {x, y) -cpp{u,u) = cpp {x,u) + cpp {u, y) - cpp (x, y) 
is positive definite. Therefore we readily see that 

Kfi{x,y) = fp (x,u) + cpp {u,y) - cpp (x,y)^ dF (u) 

is positive definite. In addition, Kp {x,y) is symmetric and continuous, and thus Kp {x,y) is a 
Mercer kernel in the terminology of 1231 . We must also verify the following assumptions. 
Assumption A. For each x G Kp (x, ■) G L 2 (]R‘^, F). 

Assumption B. Lp is a boimded and positive definite operator on L 2 (]R‘^, F) and for every 
g G L 2 (]R‘*, F), the function 

^pg{x) = j^^Kp{x,y)g{y)dF{y) 

is a continuous function on 


Assumption C. Lp has at most countably many positive eigenvalues and orthonormal eigen¬ 
functions. 

Since (pp is a symmetric continuous function that is conditionally negative definite in the 
1 satisfying q>p{x,x) = 0 for all x G R^ and E^^(X, X') < oo, we get 
that Lp is a positive definite operator on L 2 (R‘^, F). Also (201 obviously 


sense of Definition 


Vll.l 


by Proposition 
implies that Assumption A holds and 


EKj{X,r) = I^J^^Kj{x,y)dF{x)dF{y) 


< 00 , 

which by Proposition 1 of 1231 implies that the operator Lp is bounded and compact. (From 
Sun's Proposition 1 one can also infer that Lp is positive definite. However, he does not provide 
a proof. Therefore we invoke our Lemma [VlL3| here.) An elementary argument based on the 
dominated convergence theorem implies that Lpg{x) is a continuous function on R"^. Thus 
Assumption B is satisfied. Finally, since the operator Lp is compact. Theorem V1L4.5 of |lZl 
implies that Assumption C is fulfilled. Thus the assumptions of Theorem 2 of l23l hold. This 
completes the proof of Lemma Vll.l □ 
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